-
Notifications
You must be signed in to change notification settings - Fork 906
fix a memleak, close the patcher component; #13322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: wjjahah <2457791952@qq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said in the original PR, I am not entirely sure it is always safe to unload the patcher components, especially the active one, from memory. Second, I don't think we test this scenario, as I don't know what BTL/MTL that are still actively tested still requires the patcher (UCX does not).
In my understanding, the opal_patcher_base_close function is the entry point to close the patcher. Since this is the entry point, the caller should ensure that it is safe. In addition, the patcher resources have been released at the beginning of this function, but the component has not been released, which may be a previous omission. |
For any other component that would be the normal behavior, but the memory patcher is different because it tracks memory management and as such is deeply integrated with the libc. Some techniques (aka most hook registrations) are seamless and can be added/removed as needed with litle impact, but some others (like the GOT table patching) are more fragile, and I think that once installed are better left active for the rest of the execution. This means that the module that installed them should never be unloaded, but that is impossible once you call the |
I traced the code and found that |
The code is unnecessarily convoluted, but that isn't entirely true. I think there's some minor risk with this patch. Given how many of the memory hooks we've removed over the years as patcher has gotten more trusted. Note that Libfabric has basically the same set of patcher code and does unload itself. Because of how NCCL + the OFI NCCL plugin works, we're unloaded the plugin during dlclose() in Libfabric every end of run. I think we should merge this and watch MTT for a couple days, but I have no real objection to the patch. |
Is the patcher tested with our MTT ? UCX does not use it anymore, and most of the BTLs don't need it either. So, we're basically left with libfabric with or without CUDA. How much is that codepath stressed ? |
This LGTM. I agree with @bwbarrett that this should be merged and MTT watched. |
patch_list has been destroy, but component not release